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An  automatic  target  recognition  classifier  is  described  that  uses  a  set  of 
dedicated  vector  quantizers  (VQs)  in  the  wavelet  domain.  The  background 
pixels  in  each  input  image  are  properly  clipped  out  by  a  set  of  aspect  win¬ 
dows.  The  extracted  target  area  for  each  aspect  window  is  then  enlarged  to 
a  fixed  size,  after  which  a  wavelet  decomposition  is  used  to  split  this  region 
into  several  subbands.  A  dedicated  VQ  codebook  is  then  generated  for  each 
subband  of  a  particular  target  class  at  a  specific  range  of  aspects.  Thus,  each 
codebook  consists  of  a  set  of  feature  templates  that  are  iteratively  adapted 
to  represent  a  particular  subband  of  a  given  target  class  at  a  specific  range 
of  aspects.  These  templates  are  then  further  trained  by  a  modified  learn¬ 
ing  vector  quantization  (LVQ)  algorithm  that  enhances  their  discriminatory 
characteristics.  Finally,  a  path  selector  was  designed  to  speed  up  the  recog¬ 
nition  process  at  the  expense  of  a  tolerable  degradation  in  the  recognition 
rate. 
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1.  Introduction 


1.1  Background 


Human  beings  are  usually  very  good  at  recognizing  different  targets,  even 
in  a  relatively  crowded  and  changing  environment.  However,  human  per¬ 
formance  deteriorates  drastically  in  a  low-visibility  environment  or  after  an 
extended  period  of  surveillance.  Furthermore,  certain  working  environments 
are  either  inaccessible  or  too  hazardous  for  human  beings.  To  compensate 
for  such  limitations  of  human  operators,  an  accurate  and  versatile  automatic 
target  recognition  (ATR)  system  is  needed.  For  example,  an  ATR  system 
in  a  battlefield  could  alert  graveyard-shift  watchmen  with  accurate  informa¬ 
tion  about  any  approaching  vehicle,  so  that  appropriate  responses  could  be 
made  in  a  timely  fashion.  Similarly,  a  robust  ATR  system  could  reduce  the 
workloads  of  fighter  pilots  or  tank  commanders  significantly  by  suggesting 
effective  responses  in  real  time.  In  the  civilian  sector,  mission-specific  ATR 
systems  have  been  constructed  for  a  number  of  tasks,  including  autonomous 
vehicle  navigation,  automobile  manufacturing  and  inspection,  and  orchard 
sprayer  systems  in  agriculture  [1]. 

Despite  their  diversity,  all  ATR  applications  require  an  efficient  and  re¬ 
liable  target  recognition  method.  Unfortunately,  the  development  of  such 
a  method  is  often  hampered  by  the  large  number  of  target  classes  and 
aspects,  long  viewing  ranges,  obscuration,  high-clutter  background,  differ¬ 
ent  geographic  and  weather  conditions,  sensor  noise,  and  variations  caused 
by  translation,  rotation,  and  scaling  of  the  targets.  Furthermore,  a  range 
of  factors — similarities  between  the  signatures  of  different  targets,  limited 
training  and  testing  data,  camouflaged  targets,  nonrepeatability  of  target 
signatures,  and  the  difficulty  in  using  the  contextual  information  whenever 
it  is  available  to  the  recognition  system — make  the  recognition  problem  even 
more  challenging.  ATR  applications  for  military  purposes  are  especially  sus¬ 
ceptible  to  these  challenges,  because  in  comparison  to  civilian  applications, 
military  applications  tend  to  be  operated  at  a  wider  range  of  hostile  condi¬ 
tions,  and  it  is  much  harder  to  collect  an  adequate  and  suitable  set  of  data  for 
training  and  testing  purposes.  Similar  difficulties  also  occur  in  other  recog¬ 
nition  tasks,  such  as  human  face  recognition  [2,3]  and  handwriting  recogni¬ 
tion  [4,5]. 
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In  this  report,  we  use  Comanche  second-generation  forward-looking  infrared 
(FLIR)  image  chips  as  our  training  and  testing  sets.  These  images  were 
collected  at  different  sites  (Ft.  Hunter-Liggett,  CA;  Yuma,  AZ;  and  Ft. 
Grayling,  MI),  seasons  (winter  and  summer),  times  of  day  (day  and  night), 
and  operational  conditions  of  the  target  (hot  and  cold).  These  data  are  as¬ 
sumed  to  have  come  from  an  “unfamiliar  environment”  (according  to  the 
definition  given  by  Dasarathy  [6]),  because  the  identification  of  the  train¬ 
ing  data  might  not  be  reliable  (with  a  level  of  reliability  that  is  not  known 
a  priori).  Owing  to  the  inherent  characteristics  of  a  FLIR  sensor,  the  sig¬ 
natures  of  the  targets  within  the  scene  are  severely  affected  by  rain,  fog, 
and  foliage  [7].  Fortunately,  a  number  of  FLIR  image  enhancement  tech¬ 
niques  can  be  used  to  preprocess  the  FLIR  input  images  before  detection 
and  recognition.  Lo  [8]  examined  six  of  these  techniques:  threshold  zonal 
filtering,  statistical  differencing,  unsharp  masking,  prototype  automatic  tar¬ 
get  screener  technique,  constant  variance,  and  histogram  equalization.  He 
found  that  the  variable  threshold  zonal  filtering  technique  performed  most 
satisfactorily,  followed  by  the  prototype  automatic  target  screener  technique 
and  unsharp  masking. 

A  complete  ATR  system  may  consist  of  several  algorithmic  components, 
such  as  preprocessing,  detection,  segmentation,  feature  extraction,  classifi¬ 
cation,  prioritization,  tracking,  and  aim-point  selection  [9].  In  this  report, 
we  assume  that  the  locations  of  the  potential  targets  are  determined  a  priori 
by  a  high-performance  target  detection  algorithm  (a  cuer)  with  a  very  low 
false-alarm  rate.  An  example  of  such  a  detection  algorithm  is  the  ATR  rela¬ 
tional  template  matching  (ARTM)  algorithm  proposed  by  Kramer  et  al  [10]. 
The  boxes  in  figure  1  indicate  the  potential  target  areas  (target  chips)  that 
were  detected  by  the  ARTM  algorithm.  Our  focus  is  to  correctly  classify 
each  target  in  these  locations  into  one  of  the  known  target  types,  which  es¬ 
sentially  is  a  target  classification  or  recognition  task.  Therefore,  the  actual 
inputs  to  our  classifier  are  target  chips  that  were  extracted  from  the  detected 
regions  and  assumed  to  be  clutter-free.  Examples  of  fairly  good  quality  tar¬ 
get  chips  for  a  truck  at  various  viewing  aspects  are  shown  in  figure  2.  For  a 
given  input  image,  the  outputs  of  our  classifier  are  the  class  likelihoods  for 
all  target  types  considered.  These  outputs  can  then  be  used  by  later  stages, 
such  as  prioritization  and  aim-point  selection. 

Generally,  a  target  recognition  problem  can  be  attempted  with  one  or  a 
combination  of  the  statistical,  structural/syntactic,  and  neural  networks 
approaches  [11].  For  example,  the  nearest  neighbors  or  K-means  unsuper¬ 
vised  classification  methods  are  typical  examples  of  nonparametric  statisti¬ 
cal  techniques,  which  assume  that  one  can  define  K  reasonable  clusters  for 
a  given  data  set  by  minimizing  a  distance  measure  between  the  data  and 
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Figure  1.  An  example  FLIR  image  taken  in  a  typical  environment.  Boxes  indicate  potential 
target  areas  detected  by  ARTM  algorithm. 


0°  45°  90°  135° 
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180°  225°  270°  315° 

Figure  2.  Selected  target  chips  of  a  truck  at  various  aspects. 
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the  centroids  of  the  clusters  [12].  A  number  of  studies  have  been  carried 
out  to  improve  the  learning  rule  and  performance  of  these  clustering  tech¬ 
niques  [13,14].  A  supervised  sibling  of  the  A-means  clustering  technique  is 
the  learning  vector  quantization  (LVQ)  algorithm  [15,16].  In  the  LVQ  algo¬ 
rithm,  a  class  identity  reference  vector  is  available  for  each  training  sample, 
and  the  clustering  is  performed  with  respect  to  a  distance  measure  along 
with  the  class  label.  Besides  these  traditional  recognition  methods,  model- 
free  neural  network  based  methods  are  also  gaining  popularity  because  of 
their  learning  capability  and  massively  parallel  implementation  [17-23]. 

Recently,  we  proposed  an  ATR  classifier  that  employed  both  the  statistical 
and  neural  network  approaches  [24,25].  In  that  classifier,  a  number  of  vector 
quantizers  (VQs)  and  multilayer  perceptrons  (MLPs)  were  modularly  cas¬ 
caded  to  perform  the  target  recognition  task.  The  inputs  to  the  VQs  in  that 
classifier  were  image  blocks  extracted  from  the  target  chips  in  the  spatial  do¬ 
main.  However,  because  of  the  high  dimensionality  of  input  images  and  the 
scarcity  of  the  training  data,  it  is  often  necessary  to  further  reduce  the  data 
dimensionality  by  transforming  the  input  data  into  a  more  compact  feature 
space  before  the  classification  process.  For  example,  in  a  texture  classification 
task,  McLean  [26]  first  transformed  the  input  image  block  into  the  spatial 
frequency  domain  with  a  discrete  cosine  transform  (DCT).  With  these  local 
spatial  frequency  features,  he  performed  the  transform  vector  quantization 
for  the  combined  purpose  of  texture  coding  and  classification.  Besides  the 
DCT,  principal  component  analysis  (PCA)  [27]  and  the  most  discriminating 
features  (MDF)  method  [28]  are  among  the  other  techniques  that  have  been 
used  for  dimensionality  reduction  in  a  target  recognition  task.  In  this  report, 
we  reduce  dimensionality  using  a  wavelet  decomposition  process  [29]. 

In  many  situations,  it  is  quite  beneficial  to  break  up  a  complex  classifica¬ 
tion  task  into  several  smaller  and  easier  subtasks.  For  instance,  Anand  et 
al  [30]  used  a  modular  network  architecture  to  reduce  a  A-class  classifica¬ 
tion  problem  into  A  two-class  problems,  with  a  separately  trained  network 
for  each  two-class  problem.  With  this  decomposition  of  task  complexity, 
they  reported  a  faster  convergence  on  the  simpler  modules  and  noted  the 
feasibility  of  parallel  processing.  As  the  computing  hardware  and  software 
have  become  more  available  for  parallel  processing,  many  researchers  have 
proposed  massively  parallel  computing  architectures  that  are  specifically 
optimized  for  image  processing  [31-33].  We  incorporate  modularity  into  our 
classifier  by  constructing  functionally  similar  processing  paths  and  allowing 
them  to  operate  in  parallel  and  independently  from  each  other. 
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1.2  Research  Objectives 

The  goal  of  the  proposed  ATR  classifier  described  in  this  report  is  to  rec¬ 
ognize  military  targets  in  FLIR  imagery.  The  schematic  diagram,  shown  in 
figure  3,  shows  the  four  stages  of  our  classifier:  a  set  of  aspect  windows  of 
different  size,  a  stage  in  which  the  extracted  area  is  enlarged  to  a  fixed  size,  a 
stage  for  wavelet  decomposition  of  the  enlarged  extraction,  and  a  dedicated 
VQ  for  each  subband  within  each  aspect  window. 

In  the  first  stage,  an  aspect  window  is  a  background-clipping  rectangle  whose 
size  is  determined  by  the  type  of  target  and  the  range  of  aspects  that  it  op¬ 
erates  on.  These  aspect  windows  are  needed  for  accurate  extraction  of  the 
target  pixels  from  the  input  image,  so  that  the  irrelevant  background  pixels 
that  carry  little  information  about  the  target  are  removed  before  further 
processing  occurs.  After  the  background  removal  in  the  first  stage,  each  ex¬ 
tracted  target  area  is  enlarged  in  the  second  stage  into  a  fixed  dimension 
that  is  common  to  all  the  aspect  windows.  In  the  third  stage,  the  enlarged 
extraction  is  decomposed  into  four  subbands  based  on  a  wavelet  decompo¬ 
sition  process  [34] .  This  decomposition  process  subdivides  the  complexity  of 
the  recognition  task  and  reduces  the  dimensionality  of  the  VQ  in  the  follow¬ 
ing  recognition  stage.  The  generalization  capability  is  also  improved  through 


Aspect  Extraction  Wavelet  VQ 

windows  enlargement  decomposition  codebooks 


Figure  3.  Proposed  automatic  target  recognition  classifier. 
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various  manipulations  of  these  orthogonal  subbands.  After  the  wavelet  de¬ 
composition,  the  final  stage  uses  a  set  of  VQ  codebooks  for  feature  match¬ 
ing  and  target  recognition.  In  this  stage,  the  target  recognition  problem  is 
treated  as  a  template  matching  task  through  a  similarity-metric-based  ap¬ 
proach.  A  set  of  subband  templates  or  code  vectors  is  constructed  for  each 
subband  of  a  particular  target  at  a  specific  range  of  aspects.  Each  set  of 
code  vectors  forms  a  codebook ,  representing  the  target  signatures  for  a  given 
subband  of  a  particular  target  at  a  specific  range  of  aspects. 

During  the  testing  phase,  each  subband  of  the  extracted  target  area  is  rep¬ 
resented  by  a  similarity  measure  that  compares  the  given  subband  with  the 
best-matching  code  vector  from  the  corresponding  codebook.  A  commonly 
used  similarity  measure  is  the  mean  squared  error  (MSE).  We  can  infer  the 
class  of  an  input  image  by  the  MSEs  obtained  from  comparing  all  the  sub¬ 
bands  with  the  code  vectors  in  the  corresponding  codebooks.  The  class  of 
the  subband  codebooks  that  produce  the  smallest  overall  MSE  is  expected 
to  be  the  class  of  the  input  image. 

In  section  2,  we  discuss  in  detail  each  component  of  the  proposed  ATR 
classifier  and  the  algorithms  for  training  them.  Experimental  results  are 
presented  in  section  3.  A  brief  comparison  of  our  recognition  results  with 
the  performance  of  another  compatible  ATR  classifier  is  provided  in  section 
4.  Finally,  some  conclusions  are  given  in  section  5. 
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2.  Wavelet-Based  ATR  Classifier 


2.1  Aspect  Windows  and  Extraction  Enlargement 

Normally  a  target  is  surrounded  by  some  background  information  that  is 
irrelevant  for  the  correct  recognition  of  the  target.  This  background  in¬ 
formation  tends  to  create  unwanted  variations  during  the  training  phase, 
which  could  later  become  problematic  noise  in  the  testing  or  the  recognition 
phase.  Therefore,  proper  removal  of  the  unwanted  background  is  essential 
for  achieving  good  recognition  performance. 

The  size  and  shape  of  target  silhouettes  often  differ  significantly  at  differ¬ 
ent  viewing  aspects.  For  each  target,  the  classifier  uses  several  rectangu¬ 
lar  windows  of  different  size  in  order  to  remove  the  background  informa¬ 
tion.  For  example,  by  using  the  ground-truth  silhouettes  that  are  gener¬ 
ated  by  computer-aided  design  (CAD),  the  classifier  can  cluster  the  silhou¬ 
ettes  of  each  target  into  three  different  window  categories  representing  the 
front/rear,  oblique,  and  side  views  of  the  target;  thus,  a  total  of  eight  as¬ 
pect  windows  is  needed  (one  front  view,  one  rear  view,  two  side  views,  and 
four  oblique  views),  as  shown  in  figure  4.  Because  the  height  of  a  particu¬ 
lar  target  does  not  change  over  the  viewing  aspects,  the  window  clustering 
process  is  based  solely  on  the  width  of  silhouettes.  We  use  the  A-means 
algorithm  to  perform  this  clustering  process.  Since  the  size  of  the  silhou¬ 
ettes  for  the  side  views  (around  90°  and  270°)  changes  relatively  slowly  at 
different  viewing  aspects,  the  classifier  uses  a  broader  range  of  aspects  for 
the  side  aspect  windows  (a  range  of  70°)  than  for  the  head  and  tail  aspect 
windows  (a  range  of  40°).  Assuming  that  the  target  in  each  image  has  been 
shifted  to  the  center  of  the  image  and  scaled  to  a  fixed  range,  the  unwanted 
background  is  properly  clipped  away  by  these  aspect  windows.  An  example 
of  this  background  clipping  is  illustrated  in  figure  5.  In  the  proposed  algo¬ 
rithm,  the  segmentation  of  the  target  from  the  background  and  the  target 
classification  are  performed  simultaneously.  This  process  differs  from  many 
existing  ATR  methods  that  separately  perform  target  segmentation  followed 
by  a  classification  of  the  segmented  area.  Obviously,  such  classifiers  need  a 
correct  target  segmentation  for  a  successful  recognition  by  the  subsequent 
stages. 
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Figure  4.  Partition  of  aspects  into  eight  sets  for  a  target  chip. 


Figure  5.  Background  clipping  of  several  input  images.  (Top)  Original  images  of  a  truck 
viewed  at  0°,  45°,  and  90°,  respectively.  (Bottom)  Corresponding  extracted  target  areas 
after  proper  background  removal. 

After  the  background  removal,  the  extracted  target  area  is  enlarged  to  a 
fixed  size.  This  fixed  dimension  should  be  slightly  larger  than  the  largest 
aspect  window,  so  that  interpolation  can  be  used  for  all  the  aspect  windows, 
and  no  crucial  information  in  the  extracted  area  will  be  lost  by  the  enlarge¬ 
ment  process.  This  enlargement  enables  a  more  uniform  similarity  measure 
in  terms  of  dimensionality,  and  hence  legitimizes  the  comparison  between 
winning  code  vectors  from  different  aspect  windows  during  the  LVQ  train¬ 
ing  and  testing  phases.  Without  this  enlargement,  smaller  aspect  windows 
may  have  an  advantage  in  finding  a  good  match  with  an  input  image,  even 
if  that  input  image  indeed  belongs  to  a  bigger  aspect  window. 
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2.2 


Wavelet  Decomposition 


Wavelets  are  mathematical  functions  that  separate  data  into  several  different 
frequency  components,  and  then  represent  each  component  with  a  resolution 
matched  to  its  scale.  Compared  to  traditional  Fourier  methods,  wavelets 
are  more  suitable  in  analyzing  physical  situations  where  the  signal  contains 
discontinuities  and  sharp  spikes  [34],  The  biggest  difference  between  these 
two  kinds  of  transforms  is  that  the  individual  wavelet  functions  are  localized 
in  space,  while  the  Fourier  sine  and  cosine  functions  are  not.  Because  of 
the  wavelets’  space-localization  property,  many  functions  are  sparse  when 
transformed  into  the  wavelet  domain.  Because  of  this  sparseness,  wavelets 
have  been  shown  to  be  very  useful  in  detecting  features  in  images  [35,36], 
image  compression  [37-39],  texture  discrimination  [40],  removal  of  noise  from 
time  series  [41],  and  so  forth. 

The  quadrature  mirror  filter  (QMF)  was  first  introduced  by  Croisier  et  al  [42] 
as  a  tool  that  allows  alias-free  reconstruction  of  the  signal  in  the  absence 
of  quantization  errors.  More  recently,  Vetterli  and  Herley  [43]  described  the 
relationships  among  wavelets,  filter  banks,  and  multiresolution  signal  pro¬ 
cessing  in  greater  detail.  We  implemented  the  wavelet  decomposition  in  this 
report  through  QMF  using  the  simplest  Haar  filter  family  [44],  Because  of 
the  small  size  of  the  target  chips  in  our  experiments  and  the  efficiency  of 
the  ATR  classifier,  only  the  Haar  two-tap  even-length  filters  were  used.  In 
spite  of  its  simplicity  and  differential  discontinuity,  the  Haar  basis  can  still 
perform  reasonably  well  in  an  image  compression  task  if  it  has  good  impulse 
and  step  response  properties,  as  in  the  case  of  two-tap  filters  (Villasenor  et 
al  [45] ) .  Villasenor  et  al  also  found  that  even-length  filters  have  significantly 
less  shift  variance  than  odd-length  filters  and  possess  superior  impulse  re¬ 
sponse  performance,  so  that  even-length  filters  can  perform  much  better  in 
preserving  the  location,  shape,  and  intensity  of  impulses  (sharp  edges).  These 
properties  make  the  Haar  two-tap  filters  well  suited  to  a  target  recognition 
task. 

Based  on  the  wavelet  decomposition  method  described  above,  the  enlarged 
extracted  area  of  each  aspect  window  is  decomposed  into  four  subbands  of 
equal  size.  An  example  of  the  decomposition  is  shown  in  figure  6.  We  nor¬ 
malize  each  subband  by  subtracting  the  mean  of  that  subband  and  then 
dividing  each  pixel  by  its  standard  deviation.  This  way,  unwanted  varia¬ 
tions  among  similar  samples  of  a  particular  subband,  such  as  differences  in 
brightness  and  contrast,  can  be  reduced.  This  normalization  step  is  critical 
to  securing  consistent  input  information  for  the  VQ. 
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Figure  6.  Wavelet  decomposition  of  a  truck  (left)  into  four  subbands  using  Haar  two-tap 
filters  (right). 


2.3  Vector  Quantizers 

The  Voronoi  or  nearest  neighbor  VQ  belongs  to  a  special  class  of  VQ  whose 
partition  is  solely  determined  by  a  codebook  and  a  distortion  measure  [46] . 
If  the  MSE  distortion  measure  c?(X,  Y)  is  defined  as  the  mean  squared  Eu¬ 
clidean  distance  between  two  vectors  X  and  Y,  of  dimension  k.  such  that 


d(X,  Y) 


1 

k 


IX-YI 


1 

k 


k 

Y  (X*  “  Vi)2  > 

i= 1 


then  a  partition  cell  Ri  of  a  Voronoi  VQ  of  L  levels  is  defined  as 


Ri  =  {X  :  d(X,Yi)  <  d(X, Yj)  |  j  =  1,2 j  +  i}  , 

where  Ri  consists  of  all  the  points  X  in  Rk  space  that  have  the  least  distor¬ 
tion  when  reproduced  with  the  code  vector  Yj.  All  partition  cells  are  formed 
by  the  intersections  of  half-spaces  that  are  determined  by  the  code  vectors 
explicitly. 

For  each  aspect  window,  four  independent  codebooks  are  constructed,  one 
for  each  subband.  The  number  of  levels  L  for  each  subband  codebook  is 
determined  by  the  variability  of  the  information  within  that  subband.  The 
total  MSE  for  an  aspect  window  is  a  function  (such  as  a  simple  summation) 
of  the  best  MSEs  produced  by  the  code  vectors  from  the  four  codebooks 
associated  with  that  aspect  window.  Since  the  extracted  target  area  is  en¬ 
larged  to  a  fixed  dimension,  the  code  vector  size  k  is  the  same  for  all  the 
codebooks  in  all  the  aspect  windows.  Therefore,  the  total  MSE  measure  can 
legitimately  be  used  for  distortion  comparisons  between  the  codebooks  of 
different  aspect  windows. 

The  resulting  VQ  codebooks  are  indeed  a  powerful  form  of  constrained  VQ, 
which  is  usually  referred  to  as  a  product  code  VQ  [46].  To  visualize  the  idea, 
assume  an  enlarged  extraction  W  of  dimension  m  >  1.  Let  Xi,  X2, . . . ,  X5 
be  a  set  of  subbands  that  are  functions  of  W  and  jointly  determine  W. 
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There  are  functions  /,;  for  i  =  1,2 , ,b  such  that  W  can  be  decomposed 
into  subband  X,;  according  to  X,  =  /,;( W)  for  i  =  1,2,  Each  X  is 

sometimes  called  a  feature  vector  because  it  represents  some  characteristics 
of  W  while  still  partially  describing  W.  Each  feature  vector  assumes  values 
in  a  more  compact  region  of  m-dimensional  space  or  has  a  lower  dimension¬ 
ality,  and  hence  should  be  easier  to  quantize  than  W.  For  each  i,  let  C{  be 
a  codebook  with  N{  code  vectors,  which  contains  the  reproduction  values 
for  X.j.  A  product  code  VQ  is  then  a  VQ  that  finds  the  indices  of  the  closest 
centroids  X,  E  C\,  i  =  1,  2, . . . ,  b,  in  order  to  completely  represent  W.  The 
resulting  set  of  indices  is  called  a  product  code,  because  the  requirement 
that  Xj  e  Ct  for  all  i  is  equivalent  to  saying  that  the  overall  reproduced 
vector  (Xi, . . . ,  X&)  is  in  the  space  defined  by  the  Cartesian  product  C  of 
the  b  codebooks,  that  is,  (Xi, . . . ,  X&)  €  C  =  C\  x  C2  x  •  •  •  x  C'5.  The  set 
of  all  possible  reproduction  vectors  for  W  is  then  defined  by  the  set  of  all 
possible  combinations  taking  any  code  vector  from  each  of  the  b  codebooks, 
that  is,  N  =  Tli=i  -Xj  possible  reproduction  vectors  in  all.  The  W  is  indeed 
encoded  with  this  huge  set  of  reproduction  vectors.  However,  this  codebook 
is  not  optimal  in  general,  because  its  code  vectors  are  constrained  by  the 
structure  of  the  reproduced  overall  vector  (Xi, . . . ,  XQ  and  the  individual 
component  codebooks  C\. 

For  these  Voronoi  VQs,  the  training  is  performed  independently  for  each 
codebook.  Only  the  training  data  that  belong  to  a  given  target  class  and 
aspect  window  are  used  to  train  the  codebooks  that  are  dedicated  to  the  sub¬ 
bands  in  that  target  class  and  aspect  window.  To  construct  and  initialize 
the  codebooks,  we  cluster  the  training  data  for  a  given  codebook  according 
to  a  predefined  cluster  boundary.  Clusters  or  code  vectors  with  a  very  low 
population  are  discarded,  so  that  the  codebook  is  sufficiently  small  but  still 
contains  the  more  popular  feature  templates.  Nonetheless,  we  keep  at  least 
10  code  vectors  for  each  codebook  in  order  to  maintain  a  minimal  distin¬ 
guishing  capability  and  to  compensate  for  any  overly  small  cluster  boundary. 

The  X-means  algorithm  [47]  is  used  to  train  each  Voronoi  VQ  independently 
by  updating  each  code  vector  in  each  codebook  with  the  average  of  all  the 
data  that  are  closest,  in  terms  of  Euclidean  distance,  to  that  code  vector. 
The  goal  of  this  learning  process  is  to  capture,  by  minimizing  the  average 
distortion,  the  contextual  similarities  among  the  samples  that  belong  to  a 
particular  subband  of  the  intended  target  and  aspect  window.  The  training 
stops  when  no  more  changes  have  been  made  to  any  of  the  code  vectors. 
After  the  X-means  training,  the  code  vectors  of  a  codebook  will  represent 
the  most  general  structures  extracted  for  all  the  input  targets  that  belong  to 
that  particular  subband  of  a  given  target  class  and  aspect  window.  Figure  7 
shows  all  the  code  vectors  of  the  four  codebooks  that  belong  to  the  left  view 
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of  a  truck,  after  the  A-nreans  training  process.  The  edges  of  the  target  are 
relatively  blurred  in  these  code  vectors,  because  similar  input  images  do  not 
always  have  the  same  sharp  edges  and  rarely  occur  at  the  same  location. 

Since  the  average  distortion  is  minimized,  a  Voronoi  VQ  works  fairly  well  in 
data  compression.  However,  this  type  of  VQ  does  not  perform  well  as  a  clas¬ 
sifier  since  different  classes  often  overlap  in  the  feature  space.  As  a  result,  the 
independently  trained  codebooks  can  have  very  similar  code  vectors,  which 


Figure  7.  Content  of  codebooks  for  left  view  of  a  truck  after  A'-means  training  process:  bands  (a)  LL,  (b)  HL,  (c)  LH, 
and  (d)  HH.  (L  refers  to  low-pass  decomposition  and  H  to  high-pass  decomposition.) 
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may  lead  to  misclassification  when  ambiguous  input  images  are  encountered. 
A  better  method  is  needed  to  cleverly  alter  the  half-spaces  of  those  over¬ 
lapping  partitions  that  originated  from  different  Voronoi  codebooks,  so  that 
the  recognition  performance  can  be  improved. 

2.4  Learning  Vector  Quantization 

One  of  the  popular  methods  to  globally  modify  the  decision  boundaries  of  a 
Voronoi  VQ  is  supervised  LVQ.  Many  variants  of  LVQ  have  been  proposed 
and  applied  for  classification  purposes,  as  was  shown  by  Kohonen  [15,16]. 
The  version  of  the  LVQ  algorithm  proposed  in  this  report  is  modified  from 
a  variant  called  LVQ2.1,  proposed  by  Kohonen  [15].  In  LVQ2.1,  the  updates 
are  performed  on  two  code  vectors,  Yj  and  Y j,  that  are  nearest  to  a  training 
image  block  X,  provided  that  one  of  these  code  vectors  belongs  to  the  correct 
class,  while  the  other  belongs  to  a  wrong  class,  and  X  falls  within  an  update 
zone  defined  around  the  mid-plane  of  Yj  and  Y j.  Assuming  that  di  and  dj 
are  the  Euclidean  distances  of  X  from  Yj  and  Yj,  respectively,  this  update 
zone  is  defined  as  the  region  where 


and  T  is  a  threshold  whose  value  is  usually  chosen  between  0.5  and  0.7, 
depending  on  the  application.  This  update  zone  restricts  the  updates  to 
only  those  highly  ambiguous  pairs  of  code  vectors  that  are  around  the  class 
decision  boundaries.  Assuming  X  belongs  to  the  same  class  as  Yj  and  not 
to  Yj,  the  two  code  vectors  are  updated  as  follows: 

Yj(t  +  1)  =  Yj(t)  +  a(t)[X(t)  -  Yj(t)]  , 

Yj(t+1)  =  Yj(t)  —  a(t)[X(t)  —  Yj(i)]  . 

This  iterative  process  will  decrease  the  Euclidean  distance  for  Yj  and  in¬ 
crease  it  for  Yj.  The  a  is  the  learning  rate,  usually  with  an  initial  value 
around  0.1,  which  may  be  decreased  gradually  in  the  course  of  training. 

We  now  explain  the  modified  LVQ  algorithm  that  is  used  in  our  target  classi¬ 
fication  technique.  Suppose  that  we  have  a  classification  problem  consisting 
of  K  target  classes  and  W  aspect  windows  for  each  class.  Let  the  enlarged 
extracted  target  area  be  decomposed  into  B  subbands  for  each  aspect  win¬ 
dow,  and  say  that  there  are  P  patterns  in  the  training  set.  Several  variables 
are  defined  and  the  modified  LVQ  algorithm  is  summarized  as  follows: 

The  vaiiables  {  X pkwbi  ^kwbi  Bkwb  \  P  —  1, 2, . . . ,  Ej  k  —  1,2,...,  AT,  w  — 
1, 2, . . . ,  W;  b  =  1, 2, . . . ,  B  }  denote  the  subband  b  of  the  aspect  window  w 
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of  class  k  for  the  training  pattern  p,  the  codebook  dedicated  to  the  subband 
b  of  aspect  window  w  of  target  class  k,  and  the  number  of  code  vectors  in 
Ckwb,  respectively.  Also,  the  variables  {Ykwbl,  A kwU,  Ukwu,  Ekw  \  k  = 
1, 2, . . . ,  A;  w  =  1, 2, . . . ,  W;  b  =  1, 2, . . . ,  B;  l  =  1, . ,  Lkwb  }  denote  the 
Ith  code  vector  in  the  codebook  Ckwb ,  the  total  error  gradient  accumulated 
for  updating  V kwbi ,  the  frequency  of  such  accumulations  in  an  epoch,  and 
the  total  MSE  produced  by  the  aspect  window  w  of  target  class  k  for  a  given 
input  pattern. 

Step  1  :  Given  the  training  input  pattern  p'  that  belongs  to  the  window 
w'  of  class  k',  compute  the  best  total  MSE  obtained  by  matching  all  the 
B  subbands  with  the  best-matching  code  vectors  in  their  corresponding 
codebooks: 

B  / 

^ ^  I  Bfcyjb 

Ekw  /  j  I  Ulin  |  ^-p'kwb  ~^kwbl 
b= 1  \ 

for  k  =  1,2,..., A;  w  =  1,2,...,  W.  Functions  other  than  a  simple  summa¬ 
tion  can  also  be  used  in  the  calculation  of  Ekw,  such  as  summing  only  the 
best  b  <  B  MSEs.  For  this  input  pattern,  find  E* ,  the  minimum  total  MSE 
obtained  among  all  the  aspect  windows,  and  A*,  the  total  MSE  produced 
by  the  codebooks  that  belong  to  the  window  w'  of  class  k': 

E*  =  min  Ekw  and  Et  =  Ekiw /  . 

\/kw 

Now  compute  the  distance  for  the  updating  neighborhood ,  D,  which  is  a 
function  of  E* ,  such  as 

D  =  E*  x  constant . 


Step  2:  For  each  Ekw,  if  El  <  D  and  Ekw  <  D  while  both  Et  and  Ekw  are 
one  of  the  four  smallest  total  MSEs  for  this  pattern,  then  for  b  =  1, 2, . . . ,  B, 
calculate 


A  kwbl1 

A-kwbl’ 

UkwW 


A-kwbl'  "k  CxfX-p'kwb  "V kwbl ') 

A  kwbl’  a(Xp/kwb  V kwbl ') 

UkwbV  ~k  1 


if  k  =  k'  and  w  =  w'  , 
if  k  +  k'  , 

if  ( k  =  k'  and  w  =  w')  or  k  /  k'  , 


where  l'  is  the  index  of  the  winning  code  vector  in  codebook  Ckwb  in  response 
to  the  input  image  data  ~Kp'kwb  and  a  is  the  learning  rate  of  the  LVQ. 


Step  3:  Repeat  steps  1  and  2  for  all  P  training  input  patterns.  Update  all 
the  codebooks  by 


V kwbl  —  "V kwbl 


A-kwbl 
Ek  w  hi 


if  UkwU  /  0 


for  k  =  1, 2, , . ,  A;  w  =  1, 2, . . . ,  W;  b  =  1, 2, . . . ,  B\  l  =  1, 2, . . , ,  Lkwb  .  This 
forms  an  LVQ  epoch. 
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Step  4:  Clear  A  and  U.  Repeat  steps  1  through  3  until  the  codebooks 
converge  (that  is,  no  more  unintended  code  vectors  satisfy  the  conditions  in 
step  2)  or  a  predefined  number  of  epochs  has  been  reached. 

The  neighborhood  defined  in  step  1  for  selecting  the  code  vectors  to  be 
updated  is  illustrated  in  figure  8  for  a  two-dimensional  vector  space.  The  dot 
in  the  middle  of  this  figure,  labeled  X,  represents  an  input  training  pattern 
under  consideration.  The  single  triangle  represents  the  group  of  codebooks 
that  are  dedicated  to  the  correct  aspect  window  and  target  type  of  this 
sample,  with  a  total  MSE  of  El  away  from  X.  The  squares  represent  the 
groups  of  codebooks  that  are  dedicated  to  other  target  types.  One  of  these 
groups  lies  closest  to  X,  scoring  a  minimum  total  MSE  of  E* .  Several  other 
groups  are  also  very  close  to  X,  and  they  could  find  a  better  match  to  X  than 
the  intended  group.  The  goal  of  LVQ  is  to  pull  the  triangle  closer  to  X,  while 
pushing  the  squares  farther  away  from  X.  As  we  can  see  in  step  2,  no  action 
is  taken  for  the  groups  of  codebooks  that  belong  to  the  correct  target  type 
but  are  associated  with  wrong  aspect  windows  (i.e.,  k  =  k!  and  w  /  w').  The 
reason  is  that  the  wrong  windows  of  the  correct  target  often  share  certain 
characteristics  of  the  correct  window.  Therefore,  updating  those  codebooks 
in  either  direction  may  be  harmful  to  their  intended  functionalities.  On  the 
other  hand,  there  is  no  harm  in  having  a  wrong  aspect  window  of  the  correct 
target  type  be  closest  to  X,  as  our  goal  is  to  detect  the  correct  target  type 
of  X,  not  its  aspect. 

In  essence,  this  modified  LVQ  algorithm  computes  the  appropriate  decision 
boundary  adjustments  by  clustering  the  errors  caused  by  the  adjacent  code 
vectors.  This  method  works  well  if  some  meaningful  features  have  already 
been  formed  during  a  previous  learning  process,  such  as  the  X-means  train- 


Figure  8.  Neighborhood  for  updating  procedure  described  for  a  two-dimensional  vector 
space.  Dot  in  center  represents  a  training  pattern.  Triangle  and  squares  represent  groups 
of  codebooks  that  belong  to  correct  and  wrong  target  class  types  for  this  input  training 
pattern,  respectively. 
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ing  (as  in  this  case).  Otherwise,  the  updating  neighborhood  defined  in  step 
1  may  not  be  able  to  effectively  avoid  the  adverse  influence  of  the  outliers 
that  are  embedded  in  the  training  set.  Moderately  large  a  values  can  be  used 
in  this  method  without  causing  stability  problems,  because  of  the  averaged 
gradient  error  updates  in  step  3.  Hence  a  quick  and  stable  convergence  can 
be  achieved.  As  a  result  of  boundary  adjustment,  an  intended  group  of  code¬ 
books  would  be  more  likely  to  yield  the  lowest  total  MSE  for  its  intended 
input  image.  Figure  9  shows  all  the  codebooks  that  represent  the  left  view 


Figure  9.  Content  of  codebooks  for  left  view  of  a  truck  after  29  epochs  of  LVQ  training:  bands  (a)  LL,  (b)  HL,  (c)  LH, 
and  (d)  HH. 
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of  a  truck  after  29  epochs  of  the  LVQ  training.  Compared  to  figure  7,  the  fea¬ 
tures  in  figure  9  acquired  stronger  contrast  and  sharper  edges.  Apparently, 
the  LVQ  algorithm  enhanced  the  discriminability  at  those  regions  that  are 
critical  for  classification  purposes.  The  LVQ  training  process  also  rearranged 
the  order  of  code  vectors  based  on  their  usage  frequency  and  removed  those 
code  vectors  with  a  very  low  usage  frequency. 

2.5  Processing  Path  Selector 

Instead  of  passing  an  input  image  through  all  the  processing  paths  avail¬ 
able,  as  shown  in  figure  3,  it  is  possible  to  use  only  a  subset  of  these  paths 
to  process  each  input  image  without  incurring  a  significant  degradation  in 
the  target  recognition  rate.  For  instance,  if  we  search  only  the  40  most  likely 
paths  out  of  80  processing  paths  available  in  a  classifier,  the  total  computa¬ 
tional  cost  could  be  reduced  by  nearly  50  percent.  However,  the  degradation 
in  the  recognition  rate  may  be  less  than  1  percent.  In  this  way,  the  efficiency 
and  response  time  of  the  proposed  ATR  classifier  can  be  greatly  improved. 
To  realize  this  efficient  scheme,  we  add  a  processing  path  selector  at  the 
input  stage  of  the  ATR  classifier,  as  shown  in  figure  10.  To  avoid  an  ex¬ 
cessive  computational  overhead,  we  want  the  path  selector  to  be  very  fast 
and  simple.  Nonetheless,  it  must  also  be  effective  in  capturing  the  correct 
path  of  the  input  image,  even  when  only  a  small  number  of  paths  should  be 
activated  at  any  moment. 

To  build  the  path  selector,  we  create  a  most  representative  image  for  each 
aspect  window  by  taking  the  mean  of  all  training  images  that  pertain  to 
that  processing  path.  In  other  words,  a  Voronoi  codebook  with  a  single  code 
vector  is  constructed  for  each  processing  path.  As  in  the  VQ  stage,  the 
algorithm  removes  the  mean  and  variance  of  each  input  image  before  it  is 
used  in  the  path  selector,  so  that  the  unwanted  variations  in  brightness  and 
contrast  can  be  reduced.  If  we  want  to  find  the  n  most  likely  paths  for  an 
input  image,  we  first  compute  and  sort  the  MSEs  between  the  input  image 
and  all  the  representative  images.  Then  we  select  all  the  processing  paths 
whose  representative  images  have  accounted  for  the  n  lowest  MSEs.  Usually 
the  input  image  can  find  a  good  match  with  the  representative  image  of  its 
correct  path  and  hence  produces  a  relatively  low  MSE. 

Since  our  path  selector  is  a  shape  classifier  in  general,  we  can  use  the  pro¬ 
posed  LVQ  algorithm  to  enhance  its  classification  performance.  Because 
there  is  only  one  code  vector  per  processing  path  and  no  wavelet  decom¬ 
position  necessary,  the  LVQ  training  in  this  case  will  be  much  simpler  and 
quicker  than  the  one  in  the  previous  subsection.  On  the  other  hand,  it  might 
sometimes  be  difficult  to  produce  a  correct,  clear-cut  matching  between  a 
single  representative  image  for  a  given  aspect  window  and  the  input  images 
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Figure  10.  Proposed  ATR  classifier  with  a  processing  path  selector. 


corresponding  to  that  aspect  window.  Therefore,  a  larger  pool  of  candidates 
should  be  considered  for  updates  in  step  2  of  our  LVQ  algorithm.  Instead 
of  updating  only  the  best  four  candidates,  the  number  of  top  candidates 
in  the  updating  pool  should  be  increased  to,  say,  half  the  processing  paths 
available  in  the  classifier. 
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3.  Experimental  Results 


To  demonstrate  the  performance  of  the  proposed  ATR  classifier,  we  imple¬ 
mented  a  10-class  problem.  FLIR  images  of  10  targets  were  obtained  at  every 
5°  on  a  horizontal  plane  and  scaled  to  a  2-krn  viewing  range.  The  input  im¬ 
ages  are  10-bit  grey-scale  images  of  size  40x75  pixels,  which  are  assumed  to 
have  been  extracted  from  the  whole  scene  by  an  automatic  cueing  algorithm. 
For  the  sake  of  simplicity,  only  four  aspect  windows  per  target  class  (head, 
tail,  and  two  sides)  were  created  in  the  following  experiments  unless  stated 
otherwise.  The  size  of  these  aspect  windows  ranges  from  18x29  to  29x65. 
The  training  set  contains  a  total  of  13,860  images,  with  874  to  1468  images 
per  target  class.  These  images  were  taken  with  targets  in  the  open  and  they 
make  up  the  “SIG”  database.  On  the  other  hand,  the  test  set  consists  of 
3456  images  from  a  database  called  “ROI”;  this  set  has  only  5  of  the  10 
target  classes,  and  there  are  577  to  798  images  for  each  of  these  five  target 
classes.  The  ROI  data  were  taken  under  less  favorable  conditions,  such  as 
with  targets  in  and  around  clutter,  with  different  backgrounds,  and  under 
various  weather  conditions;  hence,  these  data  are  very  challenging.  Typical 
examples  of  the  SIG  and  ROI  images  are  shown  in  figure  11.  These  images 
are  fed  directly  to  the  classifier  without  any  other  preprocessing  or  filtering. 


HMMWV  BMP  T72  M35  ZSU23 


2S1  M60  Ml  13  M3  Ml 


HMMWV  BMP  T72  M35  M60 


Figure  11.  Examples  of  target  types:  10  types  taken  from  SIG  database  (top  and  center 
rows);  these  target  chips  are  relatively  easy  to  recognize.  Last  row  shows  five  target  types 
taken  from  ROI  database;  these  images  are  highly  cluttered  and  very  difficult  to  recognize. 
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3.1  Proposed  Method  and  Variants 


The  algorithm  first  decomposes  each  enlarged  extracted  area  into  four  sub¬ 
bands  using  the  Haar  two-tap  filter.  A  dedicated  codebook  of  variable  size 
is  constructed  for  each  of  these  subbands  for  a  given  aspect  window.  The 
Voronoi  quantizers  converge  after  22  epochs  of  A'-means  training.  The  chance 
that  the  correct  aspect  window  of  the  correct  target  gives  the  lowest  total 
MSE  (window  recognition  rate)  is  96.05  and  63.14  percent  for  the  SIG  and 
ROI  data,  respectively.  On  the  other  hand,  the  target  recognition  rate,  which 
is  the  chance  of  correctly  identifying  the  target  class  regardless  of  its  aspect 
window,  is  98.11  and  69.68  percent  for  the  SIG  and  ROI  data,  respectively. 

In  order  to  increase  the  discriminatory  power  of  the  classifier,  we  trained 
these  Voronoi  VQ  further  for  35  epochs  with  the  modified  LVQ  algorithm 
proposed  in  this  report.  The  Ekw  in  step  1  of  the  proposed  LVQ  algorithm 
was  computed  as  the  sum  of  the  top  three  MSEs  for  each  aspect  window,  so 
that  a  less  reliable  MSE  (usually  the  one  associated  with  the  HH  band)  was 
ignored.  In  order  to  differentiate  the  usefulness  of  information  associated 
with  each  subband,  the  algorithm  weights  the  MSE  produced  by  each  sub¬ 
band  appropriately  before  the  top-three  selection  process  above.  The  best 
target  recognition  rates  achieved  were  99.72  and  75.12  percent  for  the  SIG 
and  ROI  data,  respectively.  The  target  recognition  performance  for  each 
epoch  of  LVQ  training  is  shown  in  figure  12.  This  figure  shows  that  the  per¬ 
formance  of  the  training  set  saturated  around  99.75  percent  after  30  epochs 
of  training,  and  that  of  the  testing  set  deteriorated  gradually  after  reaching 
its  peak  at  the  29th  epoch. 


Figure  12.  Target  recognition  performance  of  LVQ  over  epochs. 
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To  demonstrate  the  superiority  of  the  proposed  method,  we  trained  and 
tested  three  variants  of  this  ATR  classifier  with  the  same  data  sets.  In  the 
first  variation,  the  extraction  enlargement  stage  is  omitted.  Different  aspect 
windows  assume  different  sizes  for  their  subbands  and  code  vectors.  The  total 
MSE  is  computed  based  on  the  top  three  subband  MSEs  that  are  normalized 
by  the  number  of  pixels  in  their  corresponding  subbands.  We  call  this  the 
no- enlargement  variant.  In  the  second  variation,  the  wavelet  decomposition 
stage  is  omitted.  The  whole  enlarged  extraction  area  is  used  to  build  a  single 
codebook  for  each  aspect  window.  Hence  the  size  of  all  code  vectors  is  the 
same  as  the  size  of  the  enlarged  extraction.  We  refer  to  this  procedure  as  the 
one-band,  variant.  Finally,  the  joint-band  variant  concatenates  all  the  four 
subbands  together  and  then  builds  a  single  codebook  for  these  concatenated 
bands;  hence,  the  advantages  of  product  code  VQ  no  longer  exist. 

Table  1  shows  the  best  performances  of  the  A-means  and  the  LVQ  train¬ 
ing  of  these  three  variants,  together  with  the  proposed  method.  The  pro¬ 
posed  method  clearly  outperformed  all  three  variants  in  terms  of  recogni¬ 
tion  performance  and  generalization  capability.  The  performance  of  the  no¬ 
enlargement  variant  is  the  closest  to  the  proposed  method,  trailing  by  just 
about  1  percent  in  all  categories.  Omitting  the  enlargement  stage  and  per¬ 
forming  comparison  with  smaller  code  vectors,  this  variant  is  computation¬ 
ally  more  efficient  than  the  proposed  method.  Therefore,  the  no-enlargement 
variant  might  be  used  when  the  efficiency  of  recognition  is  very  critical.  On 
the  other  hand,  the  joint-band  variant  required  almost  the  same  amount  of 
computational  resources  as  the  proposed  method,  but  its  performance  is  sig¬ 
nificantly  worse  than  that  of  the  proposed  method.  Without  the  advantages 
of  product  code  VQ,  the  joint-band  variant  is  less  useful  in  any  situation. 
Finally,  being  deprived  of  the  benefits  of  wavelet  decomposition  and  prod¬ 
uct  code  VQ,  the  one-band  variant  performed  the  worst  in  all  categories. 
Comparing  the  one-band  variant  to  the  joint-band  variant,  we  can  see  that 
the  wavelet  decomposition  alone  has  accounted  for  a  7.15-percent  difference 
in  the  test  performance. 


Table  1.  Best  window  and  target  recognition  rates  after  A'-means  and  LVQ  training  achieved  by  proposed  method 
and  its  three  variants.  (Best  rates  in  bold.) 


Recognition  rate  for  various  methods  (%) 


Training 

Data 

Proposed 

No  enlargement 

One  band 

Joint  band 

Window 

Target 

Window 

Target 

Window 

Target 

Window 

Target 

A-means 

SIG 

ROI 

96.05 

63.14 

98.11 

69.68 

95.81 

61.43 

98.00 

69.16 

88.77 

40.80 

92.60 

47.92 

92.11 

55.06 

95.18 

61.55 

LVQ 

SIG 

98.22 

99.72 

98.06 

99.70 

94.20 

97.60 

95.58 

98.36 

ROI 

70.43 

75.12 

68.61 

73.90 

56.28 

62.53 

63.89 

69.68 
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3.2  Data  and  Aspect  Windows 

In  previous  experiments,  the  test  performance  on  the  ROI  database  was 
significantly  lower  than  on  the  SIG  database.  These  results  suggest  that 
the  ROI  database  may  contain  many  unique  characteristics  that  are  too 
difficult  to  capture  from  the  samples  of  the  SIG  data.  Hence  we  formed  a 
new  training  set  by  randomly  selecting  80  percent  of  the  images  from  both 
the  SIG  and  ROI  databases.  The  remaining  20  percent  of  images  in  both 
databases  constituted  a  new  testing  set.  We  retrained  the  proposed  ATR 
classifier  with  the  new  training  set. 

As  shown  in  the  fourth  column  of  table  2,  the  target  recognition  rates  after 
the  A-means  training  were  96.89  and  88.88  percent  for  the  training  and 
testing  set,  respectively.  With  LVQ  training,  the  best  target  recognition 
rates  went  up  to  99.47  and  93.53  percent  for  the  training  and  testing  sets, 
respectively.  Compared  to  the  best  results  in  table  1,  the  target  recognition 
rate  after  LVQ  training  improved  from  75.12  to  93.53  percent  with  the  new 
test  set,  while  the  recognition  rate  on  the  training  set  remained  almost  the 
same.  Of  the  ROI  images  in  the  new  test  set,  91.03  percent  were  correctly 
recognized.  Therefore,  after  learning  some  characteristics  of  the  ROI  data, 
the  classifier  can  now  perform  almost  as  accurately  with  the  ROI  as  with 
the  SIG  test  images. 

In  all  the  experiments  discussed  so  far,  four  aspect  windows  were  created  for 
each  target  type.  We  also  investigated  the  effect  of  increasing  the  number  of 
aspect  windows  on  the  recognition  performance  of  the  classifier.  The  clas¬ 
sifier  was  reconfigured  so  that  eight  aspect  windows  were  created  for  each 
target  class,  as  illustrated  in  figure  4.  The  new  data  sets  just  described  were 
used  to  train  and  test  this  new  configuration.  After  the  A-means  training, 
the  target  recognition  rates  were  96.84  and  88.45  percent  for  the  training 
and  testing  sets,  respectively.  After  the  LVQ  training,  the  corresponding 
rates  were  raised  to  99.28  and  93.01  percent,  respectively.  Compared  with 
the  best  target  recognition  rates  in  the  four-window  configuration,  there  was 


Table  2.  Performance  of  proposed  ATR  classifier  configured  with  four  and  eight  aspect  windows,  respectively.  New 
training  and  testing  sets  that  consist  of  both  SIG  and  ROI  target  chips  were  used.  (Best  results  in  bold.) 

Performance  with  different  number  of  aspect  windows  (%) 

Training  Data  Four  aspect  windows  Eight  aspect  windows 


Window 

Target 

Window 

Target 

A-means 

Train 

94.88 

96.89 

93.54 

96.84 

Test 

82.96 

88.88 

78.52 

88.45 

LVQ 

Train 

98.12 

99.47 

96.69 

99.28 

Test 

87.58 

93.53 

82.41 

93.01 
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about  0.19  and  0.52  percent  degradation  in  the  training  and  testing  sets,  re¬ 
spectively.  The  number  of  processing  paths  was  doubled  from  40  to  80  in 
this  case,  but  the  average  codebook  size  was  nearly  halved,  from  30.4  to  16.6 
code  vectors. 

Therefore,  adding  oblique  windows  did  not  bring  the  expected  improvement 
in  performance,  while  the  amount  of  computation  was  increased  by  about 
9  percent,  based  on  the  total  number  of  code  vectors  processed  per  input 
image.  One  possible  reason  for  the  slightly  poorer  performance  is  the  re¬ 
duction  of  the  training  data  per  codebook  during  the  formation  of  Voronoi 
quantizers.  In  addition,  the  oblique  windows,  having  the  smallest  range  of 
aspects  (a  range  of  25°),  may  not  have  captured  features  embedded  in  their 
inputs  that  were  sufficiently  distinctive  for  the  input  images  to  be  correctly 
identified. 

3.3  Processing  Path  Selector 

We  tested  the  processing  path  selector  with  both  the  four-  and  eight-window 
configurations.  In  both  cases,  the  best  codebook  array  obtained  after  the 
LVQ  training  was  used  in  the  VQ  stage.  We  set  the  path  selector  so  that 
the  top  1/4,  1/2,  3/4,  or  all  of  the  available  processing  paths  were  used  for 
a  given  input  image.  The  target  recognition  rates  obtained  with  different 
sets  of  processing  paths  activated  are  given  in  table  3  for  both  the  four-  and 
eight- window  configurations. 

With  only  a  subset  of  processing  paths  activated,  we  can  see  that  the  degra¬ 
dation  in  performance  is  indeed  very  small.  For  instance,  with  only  20  out 
of  the  80  paths  activated  in  the  eight-window  configuration,  the  test  perfor¬ 
mance  decreased  by  merely  3.35  percent.  This  degradation  was  further  re¬ 
duced  to  0.78  percent  when  40  out  of  the  80  paths  were  used.  It  is  interesting 
to  observe  that  with  half  or  less  of  the  processing  paths  activated,  the  eight- 
window  configuration  outperformed  the  four-window  configuration.  There¬ 
fore,  the  eight-window  configuration  could  be  a  better  alternative  than  the 
four-window  configuration  in  situations  where  operational  efficiency  is  a  real 
concern. 


Table  3.  Target  recognition  rate  achieved  when  different  sets  of  processing  paths  are  chosen  by  a  path  selector. 


Configuration 

Data  - 

Performance  with  different  sets  of  paths  used  (%) 

Top  \ 

Top  | 

Top  f 

All  paths 

Four  windows 

Train 

91.45 

96.93 

98.87 

99.47 

(40  paths) 

Test 

86.89 

91.94 

92.95 

93.53 

Eight  windows 

Train 

94.72 

98.09 

98.99 

99.28 

(80  paths) 

Test 

89.66 

92.23 

92.93 

93.01 
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4.  Comparison  with  Other  Results 


Considering  the  amount  of  training  and  test  data  used  in  our  recognition 
problem,  the  results  obtained  in  our  experiments  are  quite  good.  Since  there 
is  neither  a  standard  set  of  ATR  system  requirements  nor  a  common  FLIR 
testing  set  for  ATR  problems,  it  is  relatively  difficult  to  compare  our  results 
with  most  existing  ATR  classifiers.  For  most  ATR  classifiers  described  in 
the  literature,  the  authors  generally  considered  only  a  few  target  classes 
(five  or  fewer)  and  used  very  small  training  and  testing  sets  (fewer  than  100 
samples). 

Fortunately,  we  can  make  a  suitable  comparison  with  the  results  published 
by  Mirelli  and  Rizvi  [48],  because  we  and  they  used  identical  SIG  and 
ROI  databases.  (No  other  published  results  use  the  same  databases.)  The 
schematic  diagram  of  Mirelli  and  Rizvi’s  complete  ATR  classification  system 
is  shown  as  figure  13.  They  used  a  group  classifier  to  determine  the  general 
shape  of  an  incoming  target,  followed  by  14  similar  multilayer  convolution 
neural  networks  (MLCNNs)  that  were  optimized  to  recognize  targets  of  par¬ 
ticular  size  and  shape.  Each  MLCNN  consists  of  four  layers  that  are  made 
up  of  25  to  50  convolutional  kernels,  so  that  each  is  a  rather  complex  module 
by  itself.  The  outputs  of  these  MLCNNs  were  combined  by  another  MLP  to 
produce  the  final  recognition  score.  In  a  10-class  problem,  the  best  recogni¬ 
tion  result  Mirelli  and  Rizvi  obtained  for  the  complete  ROI  testing  set  was 
73.41  percent,  which  is  slightly  lower  than  the  75.12  percent  that  our  clas¬ 
sifier  obtained  under  the  same  conditions.  As  they  reported  a  2.2-percent 
improvement  brought  by  the  final  MLP  alone,  we  expect  that  our  result  can 
also  be  further  improved  by  the  addition  of  an  appropriate  postprocessor  at 
the  end  of  our  processing  paths. 
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Figure  13.  ATR  classifier  proposed  by  Mirelli  and  Rizvi  [48]. 
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5.  Conclusions 


A  sad  lesson  learned  by  our  armed  forces  during  Operation  Desert  Storm 
is  the  difficulty  of  distinguishing  between  friendly  and  hostile  vehicles  or 
equipment  under  poor  visibility  conditions.  Consequently,  more  than  half 
of  the  American  casualties  were  inflicted  by  so-called  “friendly  fire.”  To 
overcome  similar  problems  in  the  future,  we  need  ATR  systems  that  are 
effective  and  efficient  in  distinguishing  enemy  and  friendly  targets  under 
adverse  environmental  conditions.  For  this  reason,  a  number  of  ATR  research 
projects  are  actively  carried  out  by  the  U.S.  Army  Research  Laboratory.  In 
the  long  run,  these  efforts  could  significantly  contribute  to  the  safety  of  our 
troops  and  the  security  of  our  nation. 

In  this  report,  we  propose  an  ATR  classifier  that  consists  of  several  aspect 
windows,  an  extraction  enlargement  procedure,  a  wavelet  decomposition, 
and  a  set  of  product  code  VQs.  Background  noise  and  effective  dimension¬ 
ality  are  greatly  reduced  by  the  variable-size  aspect  windows.  As  shown  by 
the  performance  of  the  no-enlargenrent  method,  the  extraction  enlargement 
is  necessary  to  achieve  higher  recognition  rates  with  a  more  compatible  sim¬ 
ilarity  measure  between  the  aspect  windows.  Wavelet  decomposition  of  the 
enlarged  extraction  subdivides  the  complexity  of  the  recognition  task,  ex¬ 
tracts  target  features  at  different  perspectives,  and  enables  the  use  of  a  prod¬ 
uct  code  VQ.  The  A-means  algorithm  and  a  modified  LVQ  algorithm  have 
been  used  to  capture  intra-target  similarities  and  to  enhance  inter-target 
discriminability.  We  tested  three  variants  of  the  proposed  ATR  classifier  in 
order  to  examine  the  effects  of  the  extraction  enlargement  procedure,  the 
wavelet  decomposition,  and  the  product  code  VQ.  Each  of  these  three  pro¬ 
cedures  was  one  component  of  the  proposed  method.  We  constructed  each 
variant  by  omitting  one  of  the  three  procedures.  It  was  found  that  all  the 
three  variants  resulted  in  recognition  performance  that  was  worse  than  the 
proposed  method;  therefore,  each  of  the  stages  is  critical  to  the  proposed 
method. 
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